---
title: Deleting Checkpoints
description: Learn how to automatically delete low-performing model checkpoints
---

Training jobs can run for thousands of steps, and each step generates a new model checkpoint. For most training runs, these checkpoints are LoRAs that takes up 80-150MB of disk space. To reduce storage overhead and preserve only the best checkpoint from your runs, you can set up automatic deletion of all but your best-performing and most recent checkpoints.

## Deleting low-performing checkpoints

To delete all but the most recent and best-performing checkpoints of a model, call the `delete_checkpoints` method as shown below.

```python
import art
# also works with LocalBackend and SkyPilotBackend
from art.serverless.backend import ServerlessBackend

model = art.TrainableModel(
    name="agent-001",
    project="checkpoint-deletion-demo",
    base_model="OpenPipe/Qwen3-14B-Instruct",
)
backend = ServerlessBackend()
# in order for the model to know where to look for its existing checkpoints,
# we have to point it to the correct backend
await model.register(backend)

# deletes all but the most recent checkpoint
# and the checkpoint with the highest val/reward
await model.delete_checkpoints()
```

By default, `delete_checkpoints` ranks existing checkpoints by their `val/reward` score and erases all but the highest-performing and most recent. However, `delete_checkpoints` can be configured to use any metric that it is passed.

```python
await model.delete_checkpoints(best_checkpoint_metric="train/eval_1_score")
```

Keep in mind that once checkpoints are deleted, they generally cannot be recovered, so use this method with caution.


## Deleting within a training loop

Below is a simple example of a training loop that trains a model for 50 steps before exiting. By default, the LoRA checkpoint generated by each step will automatically be saved in the storage mechanism your backend uses (in this case W&B Artifacts).

```python

import art
from art.serverless.backend import ServerlessBackend

from .rollout import rollout
from .scenarios load_train_scenarios

TRAINING_STEPS = 50

model = art.TrainableModel(
    name="agent-001",
    project="checkpoint-deletion-demo",
    base_model="OpenPipe/Qwen3-14B-Instruct",
)
backend = ServerlessBackend()
await model.register(backend)


train_scenarios = load_train_scenarios()

# training loop
for _step in range(await model.get_step(), TRAINING_STEPS):
    train_groups = await art.gather_trajectory_groups(
        (
            art.TrajectoryGroup(rollout(model, scenario, step) for _ in range(8))
            for scenario in train_scenarios
        ),
        pbar_desc=f"gather(train:{step})",
    )
    # trains model and automatically persists each LoRA as a W&B Artifact
    # ~120MB per step
    await model.train(
        train_groups,
        config=art.TrainConfig(learning_rate=5e-5),
    )

# ~6GB of storage used by checkpoints
```

However, since each LoRA checkpoint generated by this training run is ~120MB, in total this training run will require ~6GB of storage for the model checkpoints alone. To reduce our storage overhead, let's implement checkpoint deletion on each step.


```python
...
# training loop
for _step in range(await model.get_step(), TRAINING_STEPS):
    train_groups = await art.gather_trajectory_groups(
        (
            art.TrajectoryGroup(rollout(model, scenario, step) for _ in range(8))
            for scenario in train_scenarios
        ),
        pbar_desc=f"gather(train:{step})",
    )
    # trains model and automatically persists each LoRA as a W&B Artifact
    # ~120MB per step
    await model.train(
        train_groups,
        config=art.TrainConfig(learning_rate=5e-5),
    )
    # clear all but the most recent and best-performing checkpoint on the train/reward metric
    await model.delete_checkpoints(best_checkpoint_metric="train/reward")

# ~240MB of storage used by checkpoints
```

With this change, we've reduced the total amount of storage used by checkpoints from 6GB to 240MB, while preserving the checkpoint that performed the best on `train/reward`.
